Minimizing Detail Data in Data Warehouses
نویسندگان
چکیده
Data warehouses collect and maintain large amounts of data from several distributed and heterogeneous data sources. Because of security reasons, operational requirements, and technical feasibility it is often impossible for data warehouses to access the data sources directly. Instead data warehouses have to replicate legacy information as detail data in order to be able to maintain their summary data. In this paper we investigate how to minimize the amount of detail data stored in a data warehouse. More speci cally, we identify the minimal amount of data that has to be replicated in order to maintain, either incrementally or by recomputation, summary data de ned in terms of generalized project-select-join (GPSJ) views. We show how to minimize the number of tuples and attributes in the current detail tables and even aggregate them where possible. The amount of data to be stored in current detail tables is minimized by exploiting smart duplicate compression in addition to local and join reductions. We identify situations where it becomes possible to omit the typically huge fact table and prove that these techniques in concert ensure that the current detail data is minimal in the sense that no subset of it permits to accurately maintain the same summary data. Finally, we sketch how existing maintenance methods can be adapted to use the minimal detail tables we propose.
منابع مشابه
Specification-Based Data Reduction in Dimensional Data Warehouses
Many data warehouses contain massive amounts of data and grow rapidly. Examples include warehouses with retail sales data capturing customer behavior and warehouses with click-stream data capturing user behavior on web sites. The sheer size of these warehouses makes them increasingly hard to manage and query efficiently. As time passes, old, detailed data in the warehouses tend to become less i...
متن کاملWARLOCK: A Data Allocation Tool for Parallel Warehouses
We present the WARLOCK tool to automatically determine a parallel data warehouse’s allocation to disk. This GUIequipped tool is implemented in Java and utilizes an internal cost model and heuristics to determine a disk allocation minimizing both I/O work and query response times. WARLOCK recommends a ranked list of fragmentation candidates, a detailed query performance analysis and a tailored p...
متن کاملSolving a mathematical model with multi warehouses and retailers in distribution network by a simulated annealing algorithm
Determination of shipment quantity and distribution problem is an important subject in today’s business. This paper describes the inventory/distribution network design. The system addresses a class of distribution network design problem, which is characterized by multiple products family, multiple warehouses and retail-ers. The maximum capacity of vehicles and warehouses are also known. The res...
متن کاملOrder Fulfillment in Online Retailing: What Goes Where
We present three problems motivated by order fulfillment in online retailing. First, we focus on one warehouse or fulfillment center. To optimize the storage space and labor, an e-tailer splits the warehouse into two regions with different storage densities. One is for picking customer orders and the other to hold a reserve stock that replenishes the picking area. Consequently, the warehouse is...
متن کاملOrder Fulfillment in Online Retailing : What Goes
We present three problems motivated by order fulfillment in online retailing. First, we focus on one warehouse or fulfillment center. To optimize the storage space and labor, an e-tailer splits the warehouse into two regions with different storage densities. One is for picking customer orders and the other to hold a reserve stock that replenishes the picking area. Consequently, the warehouse is...
متن کامل